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ABSTRACT 

Learned feature representations and sub-phoneme posteriors 
from Deep Neural Networks (DNNs) have been used sepa¬ 
rately to produce significant performance gains for speaker 
and language recognition tasks. In this work we show how 
these gains are possible using a single DNN for both speaker 
and language recognition. The unified DNN approach is 
shown to yield substantial performance improvements on the 
the 2013 Domain Adaptation Challenge speaker recognition 
task (55% reduction in EER for the out-of-domain condition) 
and on the NIST 2011 Language Recognition Evaluation 
(48% reduction in EER for the 30s test condition). 

Index Terms: i-vector, DNN, bottleneck features, speaker 
recognition, language recognition 

1. INTRODUCTION 

The impressive gains in performance obtained using deep 
neural networks (DNNs) for automatic speech recognition 
(ASR) Q have motivated the application of DNNs to other 
speech technologies such as speaker recognition (SR) and 
language recognition (LR) ||2l[3|4l|5l|6l|2l[8l|9l. Two general 
methods of applying DNN’s to the SR and LR tasks have 
been shown to be effective. The first or “direct” method uses 
a DNN trained as a classifier for the intended recognition 
task. In the direct method the DNN is trained to discrimi¬ 
nate between speakers for SR Q or languages for LR ID. 
The second or “indirect” method uses a DNN trained for a 
different purpose to extract data that is then used to train a 
secondary classifier for the intended recognition task. Appli¬ 
cations of the indirect method have used a DNN trained for 
ASR to extract frame-level features miuiioi, accumulate a 
multinomial vector El or accumulate multi-modal statistics 
EH that were then used to train an i-vector system EHEl. 

The unified DNN approach described in this work uses 
two of the indirect methods described above. The first indi¬ 
rect method (“bottleneck”) uses frame-level features extracted 
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from a DNN with a special bottleneck layer ifTSI and the sec¬ 
ond indirect method (“DNN-posterior”) uses posteriors ex¬ 
tracted from a DNN to accumulate multi-modal statistics ||6l. 
The features and the statistics from both indirect methods are 
then used to train four different i-vector systems: one for each 
task (SR and LR) and each method (bottleneck and DNN- 
posterior). A key point in the unified approach is that a single 
DNN is used for all four of these i-vector systems. Addition¬ 
ally, we will examine the feasibility of using a single i-vector 
extractor for both SR and LR. 

2. I-VECTOR CLASSIFIER FOR SR AND LR 

Over the past 5 years, state-of-the-art SR and LR performance 
has been achieved using i-vector based systems El. In addi¬ 
tion to using an i-vector classifier as a baseline approach for 
our experiments, we will also show how phonetic-knowledge 
rich DNN feature representations and posteriors can be incor¬ 
porated into the i-vector classifier framework providing sig¬ 
nificant performance improvements. In this section we pro¬ 
vide a high-level description of the i-vector approach (for a 
detailed description see, for example, nuiH). 

In Figure [T] we show a simplified block diagram of i- 
vector extraction and scoring. An audio segment is first 
processed to find the locations of speech in the audio (speech 
activity detection) and to extract acoustic features that convey 
speaker/language information. Typically 20 dimensional mel- 
frequency cepstral coefficients (MFCC) and derivatives are 
used for SR and 56 dimensional static cepstra plus shifted- 
delta cepstra (SDC) are used for LR analyzed at 100 fea¬ 
ture vectors/second. Using a Universal Background Model 
(UBM), essentially a speaker/language-independent Gaussian 
mixture model (GMM), the per-mixture posterior probability 
of each feature vector (“GMM-posterior”) is computed and 
used, along with the feature vectors in the segment, to ac¬ 
cumulate zeroth, first, and second order sufficient statistics 
(SS). These SSs are then transformed into a low dimensional 
i-vector representation (typically 400-600 dimensions) using 
a total variability matrix, T. The i-vector is whitened by sub¬ 
tracting a global mean, m, scaled by the inverse square root 
of a global covariance matrix, W, and then normalized to unit 
length m. Finally, a score between a model and test i-vector 
is computed. The simplest scoring function is the cosine dis- 
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Fig. 1. Simplified block diagram of i-vector extraction and 
scoring. 
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Fig. 2. Example DNN architecture 


tance between the i-vector representing a speaker/language 
model (average of i-vectors from the speaker’s/language’s 
training segments) and the i-vector representing the test seg¬ 
ment. The current state-of-the-art scoring function, called 
Probabilistic Linear Discriminant Analysis (PLDA) Ifl4l . 
requires a within-class matrix Ewe, characterizing how i- 
vectors from a single speaker/language vary, and an across 
class matrix Eac, characterizing how i-vectors between dif¬ 
ferent speakers/languages vary. 

Collectively, the UBM, T, W, m. Ewe, and Eae are 
known as the system’s hyper-parameters and must be esti¬ 
mated before a system can enroll and/or score any data. The 
UBM, T, W, and m represent general feature distributions 
and total variance of statistics and i-vectors, so unlabeled data 
from the desired audio domain (i.e., telephone, microphone, 
etc.) can be used to estimate them. The Ewe and Eac matri¬ 
ces, however, each require a large collection of labeled data 
for training. For SR, Ewe and Eac typically require thousands 
of speakers each of whom contributes tens of samples to the 
data set. For LR, the enrollment samples from each desired 
languages, which typically hundreds of samples from many 
different speakers, can be used to estimate Ewe and Eac- 

By far the most computationally expensive part of an i- 
vector system is extracting the i-vectors themselves. An effi¬ 
cient approach for performing both SR and LR on the same 
data is to use the same i-vectors. This may be possible if both 
systems use the same feature extraction, UBM, and T ma¬ 
trices. There may be some tradeoff in performance however 
since the UBM, T matrix, and signal processing will not be 
specialized for SR or LR. 


3. DEEP NEURAL NETWORK CLASSIEIER FOR 
SPEECH APPLICATIONS 

3.1. DNN architecture 

A DNN, like a multi-layer preceptron (MLP), consists of an 
input layer, several hidden layers and an output layer. Each 
layer has a fixed number of nodes and each sequential pair of 
layers are fully connected with a weight matrix. The activa¬ 
tions of nodes on a given layer are computed by transform¬ 
ing the output of the previous layer with the weight matrix: 

The output of a given layer is then com¬ 
puted by applying an “activation function” = h^®) (a^®)) 
(see Figure |2]l. Commonly used activation function include 
the sigmoid, the hyperbolic tangent, rectified linear units and 
even a simple linear transformation. Note that if all the ac¬ 
tivation functions in the network are linear then the stacked 
matrices reduce to a single matrix multiply. 

The type of activation function used for the output layer 
depends on what the DNN is used for. If the DNN is trained 
as a regression the output activation function is linear and the 
objective function is the mean squared error between the out¬ 
put and some target data. If the DNN is trained as a classifier 
then the output activation function is the soft-max and the ob¬ 
jective function is the cross entropy between the output and 
the true class labels. For a classifier, each output node of the 
DNN classifier correspond to a class and the output is an es¬ 
timate of the posterior probability of the class given the input 
data. 

3.2. DNN Training for ASR 

DNN classifiers can be used as acoustic models in ASR sys¬ 
tems to compute the posterior probability of a sub-phonetic 
unit (a “senone”) given an acoustic observation. Observa¬ 
tions, or feature vectors, are extracted from speech data at a 
fixed sample rate using a spectral technique such as filterbank 
analysis, MFCC, or perceptual linear prediction (PLP) coeffi¬ 
cients. Decoding is preformed using a hidden Markov model 
(HMM) and the DNN to find the most likely sequence of 
senones given the feature vectors (this requires using Bayes’ 
rule to convert the DNN posteriors to likelihoods). Train¬ 
ing the DNN requires a significant amount of manually tran¬ 
scribed speech data m. The senones labels are derived from 
the transcriptions using a phonetic dictionary and a state-of- 
the-art GMM/HMM ASR system. Generally speaking, a re¬ 
fined set of phonotactic units aligned using a high performing 
ASR system is required to train a high performing DNN sys¬ 
tem ifn . 

DNN training is essentially the same as traditional MLP 
training. The most common approach uses stochastic gradi¬ 
ent descent (SGD) with a mini-batch for updating the DNN 
parameters throughout a training pass or “epoch”. The back- 
propegation algorithm is used to estimate the gradient of the 
DNN parameters for each mini-batch. Initializing the DNN is 
critical, but it has been shown that a random initialization is 
































adequate for speech applications where there is a substantial 
amount of data M- A held out validation data set is used 
to estimate the error rate after each training epoch. The SGD 
algorithm uses a heuristic learning rate parameter that is ad¬ 
justed in accordance with a scheduling algorithm which mon¬ 
itors the validation error rate at each epoch. Training ceases 
when the error rate can no longer be reduced. 

In the past, training neural networks with more than 2 hid¬ 
den layers proved to be problematic. Recent advances in fast 
and affordable computing hardware, optimization software 
and initialization techniques have made it possible to train 
much deeper networks. A typical DNN for ASR will have 
5 or more hidden layers each with the same number of nodes 
- typically between 500 and 3,000 m. The number of output 
senones varies from a few hundred to tens of thousands M- 

3.3. DNN bottleneck features 

A DNN can also be used as a means of extracting features for 
use by a secondary classifier - including another DNN ifTbl . 
This is accomplished by sampling the activation of one of the 
DNN’s hidden layers and using this as a feature vector. For 
some classifiers the dimensionality of the hidden layer is too 
high and some sort of feature reduction is necessary like LDA 
or PC A. Inini, a dimension reducing linear transformation is 
optimized as part of the DNN training by using a special bot¬ 
tleneck hidden layer that has fewer nodes (see Figure|2]l. The 
bottleneck layer uses a linear activation so that it behaves very 
much like a LDA or PCA transformation on the activation of 
the previous layer. The bottleneck DNN used in this work is 
the same system described in ifTSl . In theory any layer can be 
used as a bottleneck layer, but in our work we have chosen 
to use the second to last layer with the hope that the output 
posterior prediction will not be too adversely affected by the 
loss of information at the bottleneck. 

3.4. DNN stats extraction for an i-vector system 

A typical i-vector system uses zeroth, first and second order 
statistics generated using a GMM. Statistics are accumulated 
by first estimating the posterior of each GMM component 
density for a frame (the “occupancy”) and using these posteri¬ 
ors as weights for accumulating the statistics for each compo¬ 
nent of the mixture distribution. The zeroth order statistics are 
the total occupancies for an utterance across all GMM com¬ 
ponents and the first order statistics are the weighted sum of 
the means per a component. The i-vector is then computed 
using a dimension reducing transformation that is non-linear 
with respect to the zeroth order statistics. 

An alternate approach to extracting statistics has been pro¬ 
posed in m. Statistics are accumulated in the same way as 
for the GMM but class posteriors from the DNN are used in 
place of GMM component posteriors. Once the statistics have 
been accumulated, the i-vector extraction is performed in the 
same way as it is from the GMM based statistics. This ap¬ 


proach has been shown to give significant gains for both SR 
andLR gllTlini. 

4. EXPERIMENT SETUP 

4.1. Corpora 

Three different corpora are used in our experiments. The 
DNN itself is trained using a 100 hours subset of Switchboard 
1 na. The 100 hour Switchboard subset is defined in the ex¬ 
ample system distributed with Kaldi ifTOl . The SR systems 
were trained and evaluated using the 2013 Domain Adapta¬ 
tion Challenge (DAC13) data ll20l . The LR systems were 
evaluated on the NIST 2011 Language Recognition Evalua¬ 
tion (LREl 1) data ||2T1 . Details on the LR training and devel¬ 
opment data can be found in ll22l . 

4.2. System configuration 

4.2.1. Commonalities 

All systems use the same speech activity segmentation gen¬ 
erated using a GMM based speech activity detector (GMM 
SAD). The i-vector system uses MAP and PPCA to estimate 
the T matrix. Scoring is performed using PLDA IT4l . With 
the exception of the input features or multi-modal statistics, 
the i-vector systems are identical and use a 2048 component 
GMM UBM and a 600 dimensional i-vector subspace. All 
LR systems use the discriminative backend described in ll22ll . 

4.2.2. Baseline systems 

The front-end feature extraction for the baseline LR sys¬ 
tem uses 7 static cepstra appended with 49 SDC. Unlike the 
front-end described in ll22l . vocal track length normalization 
(VTLN) and feature domain nuisance attribute projection 
(fNAP) are not used. The front-end for the baseline SR sys¬ 
tem uses 20 MFCCs including CO and their first derivatives 
for a total of 40 features. 

4.2.3. DNN system 

The DNN was trained using 4,199 state cluster (“senone”) 
target labels generated using the Kaldi Switchboard 1 “tri4a” 
example system m. The DNN front-end uses 13 Gaussian- 
ized PLP coefficients and their first and second order deriva¬ 
tives (39 features) stacked over a 21 frame window (10 frames 
to either side of the center frame) for a total of 819 input fea¬ 
tures. The GMM SAD segmentation is applied to the stacked 
features. 

The DNN has 7 hidden layers of 1024 nodes each with 
the exception of the 6* bottleneck layer which has 64 nodes. 
All hidden layers use a sigmoid activation function with the 
exception of 6* layer which is linear ifTSll . The DNN train¬ 
ing is preformed on an nVidia Tesla K40 GPU using custom 
software developed at MIT/CSAIL. 
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Posteriors 

EER(%) 

DCF* 1000 

MFCC 

GMM 

2.71 

0.404 

MFCC 

DNN 

2.27 

0.336 

Bottleneck 

GMM 

2.00 

0.269 

Bottleneck 

DNN 

2.79 

0.388 


Table 1. In-domain DAC13 results 
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Posteriors 

EER(%) 

DCF* 1000 

MFCC 

GMM 

6.18 

0.642 

MFCC 

DNN 

3.27 

0.427 

Bottleneck 

GMM 

2.79 

0.342 

Bottleneck 

DNN 

3.97 

0.454 


Table 2. Out-of-domain DAC13 results 


5. EXPERIMENT RESULTS 

5.1. Speaker recognition experiments 

Two sets of experiments were run on the DAC13 corpora; “in¬ 
domain” and “out-of-domain”. For both sets of experiments, 
the UBM and T hyper-parameters are trained on Switchboard 
(SWB) data. The other hyper-parameters (the W, m , Swc 
and Eac) are trained on 2004-2008 speaker recognition eval¬ 
uation (SRE) data for the in-domain experiments and SWB 
data for the out-of-domain experiments (see ll20i for more de¬ 
tails). Tables [U and |2] summarize the results for the in-domain 
and out-of-domain experiments with the first row of each ta¬ 
ble corresponding to the baseline system. While the DNN- 
posterior technique with MFCCs gives a significant gain over 
the baseline system for both sets of experiments, as also re¬ 
ported in Inland iflTl . an even greater gain is realized us¬ 
ing bottleneck features with a GMM. Unfortunately, using 
both bottleneck features and DNN-posteriors degrades per¬ 
formance. 

5.2. Language recognition experiments 

The experiments run on the LREll task are summarized in 
Table [3] with the first row corresponding to the baseline sys¬ 
tem and the last row corresponding to a fusion of 5 “post¬ 
evaluation” systems (see ll22ll for details). Bottleneck features 
with GMM posteriors out performs the other systems conhg- 
urations including the 5 system fusion. Interestingly, bottle¬ 
neck features with DNN-posteriors show more of an improve¬ 
ment over the baseline system than in the speaker recognition 
experiments. 
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Posteriors 

30s 

10s 

3s 

SDC 

GMM 

5.26 

10.7 

20.9 

SDC 

DNN 

4.00 

8.21 

19.5 

Bottleneck 

GMM 

2.76 

6.55 

15.9 

Bottleneck 

DNN 

3.79 

7.71 

18.2 

5-way fusion 

3.27 

6.67 

17.1 


Table 3. LREl 1 results Cavg 


UBM/T 

DAC13 in-domain 

LREl 1 30s 

DAC13 

2.00% EER / 0.269 DCF 

6.12 Cavg 

LREll 

2.68% EER / 0.368 DCE 

2.76 Cavg 


Table 4. Cross-task DNN-bottelneck feature i-vector systems 

5.3. Cross-task i-vector Extraction 

Table 0 ] shows the performance on the DAC13 and LREll 
tasks when extracting i-vectors using parameters from one of 
the two systems. As expected, there is a degradation in perfor¬ 
mance for the mis-matched task, but the degradation is less on 
the DAC13 SR task using the LREll LR hyper-parameters. 
These result motivate further research in developing a uni- 
hed i-vector extraction system for both SR and LR by careful 
UBM/T training data selection. 

6. CONCLUSIONS 

This paper has presented a DNN bottleneck feature extrac¬ 
tor that is effective for both speaker and language recogni¬ 
tion and produces significant performance gains over state- 
of-the-art MECC/SDC i-vector approaches as well as more 
recent DNN-posterior approaches. For the speaker recogni¬ 
tion DAC13 task, the new DNN bottleneck features decreased 
in-domain EER by 26% and DCF by 33% and out-of-domain 
EER by 55% and DCF by 47%. The out-of-domain results 
are particularly interesting since no in-domain data was used 
for DNN training or hyper-parameter adaptation. On LREl 1, 
the same bottleneck features decreased EERs at 30s, 10s, and 
3s test durations by 48%, 39%, and 24%, respectively, and 
even out performed a 5 system fusion of acoustic and phonetic 
based recognizers. A final set of experiments demonstrated 
that it may be possible to use a common i-vector extractor for 
a unihed speaker and language recognition system. Although 
not presented here, it was also observed that recognizers us¬ 
ing the new DNN bottleneck features produced much better 
calibrated scores as measured by CLLR metrics. 

The DNN bottleneck features, in essence, are the learned 
feature representation from which the DNN posteriors are de¬ 
rived. Experimentally, it appears that using the learned fea¬ 
ture representation is better than using just the output poste¬ 
riors with SR or LR features, but combining the DNN bot¬ 
tleneck features and DNN posteriors degrades performance. 
This may be because we are able to train a better suited poste¬ 
rior estimator (UBM) with data more matched to the task data. 
Since we are working with new features, future research will 
examine whether there are more effective classifiers to apply 
than i-vectors. Other future research will explore the sensitiv¬ 
ity of the bottleneck features to the DNN’s conhguration, and 
training data quality and quantity. 
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